{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example of usage model from sklift.models in sklearn.pipeline\n",
    "\n",
    "<br>\n",
    "<center>\n",
    "    <a href=\"https://colab.research.google.com/github/maks-sh/scikit-uplift/blob/master/notebooks/pipeline_usage_EN.ipynb\">\n",
    "        <img src=\"https://colab.research.google.com/assets/colab-badge.svg\">\n",
    "    </a>\n",
    "    <br>\n",
    "    <b><a href=\"https://github.com/maks-sh/scikit-uplift/\">SCIKIT-UPLIFT REPO</a> | </b>\n",
    "    <b><a href=\"https://scikit-uplift.readthedocs.io/en/latest/\">SCIKIT-UPLIFT DOCS</a> | </b>\n",
    "    <b><a href=\"https://scikit-uplift.readthedocs.io/en/latest/user_guide/index.html\">USER GUIDE</a></b>\n",
    "    <br>\n",
    "    <b><a href=\"https://nbviewer.jupyter.org/github/maks-sh/scikit-uplift/blob/master/notebooks/pipeline_usage_RU.ipynb\">RUSSIAN VERSION</a></b>\n",
    "\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-26T12:44:35.435852Z",
     "start_time": "2020-04-26T12:44:35.239050Z"
    }
   },
   "source": [
    "This is a simple example on how to use [sklift.models](https://scikit-uplift.readthedocs.io/en/latest/api/models.html) with [sklearn.pipeline](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline).\n",
    "\n",
    "The data is taken from [MineThatData E-Mail Analytics And Data Mining Challenge dataset by Kevin Hillstrom](https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html).\n",
    "\n",
    "This dataset contains 64,000 customers who last purchased within twelve months. The customers were involved in an e-mail test:\n",
    "* 1/3 were randomly chosen to receive an e-mail campaign featuring Mens merchandise.\n",
    "* 1/3 were randomly chosen to receive an e-mail campaign featuring Womens merchandise.\n",
    "* 1/3 were randomly chosen to not receive an e-mail campaign.\n",
    "\n",
    "During a period of two weeks following the e-mail campaign, results were tracked. The task is to tell the world if the Mens or Womens e-mail campaign was successful.\n",
    "\n",
    "The full description of the dataset can be found at the [link](https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html).\n",
    "\n",
    "Firstly, install the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-07T01:01:39.897817Z",
     "start_time": "2021-02-07T01:01:39.890409Z"
    }
   },
   "outputs": [],
   "source": [
    "!pip install scikit-uplift xgboost==1.0.2 category_encoders==2.1.0 -U"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For simplicity of the example, we will leave only two user segments:\n",
    "* those who were sent an e-mail advertising campaign with women's products;\n",
    "* those who were not sent out the ad campaign.\n",
    "\n",
    "We will use the `visit` variable as the target variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-07T01:01:42.438253Z",
     "start_time": "2021-02-07T01:01:39.901510Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape of the dataset before processing: (64000, 8)\n",
      "Shape of the dataset after processing: (42693, 8)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>recency</th>\n",
       "      <th>history_segment</th>\n",
       "      <th>history</th>\n",
       "      <th>mens</th>\n",
       "      <th>womens</th>\n",
       "      <th>zip_code</th>\n",
       "      <th>newbie</th>\n",
       "      <th>channel</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>2) $100 - $200</td>\n",
       "      <td>142.44</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Surburban</td>\n",
       "      <td>0</td>\n",
       "      <td>Phone</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6</td>\n",
       "      <td>3) $200 - $350</td>\n",
       "      <td>329.08</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Rural</td>\n",
       "      <td>1</td>\n",
       "      <td>Web</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7</td>\n",
       "      <td>2) $100 - $200</td>\n",
       "      <td>180.65</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Surburban</td>\n",
       "      <td>1</td>\n",
       "      <td>Web</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>1) $0 - $100</td>\n",
       "      <td>45.34</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Urban</td>\n",
       "      <td>0</td>\n",
       "      <td>Web</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6</td>\n",
       "      <td>2) $100 - $200</td>\n",
       "      <td>134.83</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Surburban</td>\n",
       "      <td>0</td>\n",
       "      <td>Phone</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   recency history_segment  history  mens  womens   zip_code  newbie channel\n",
       "0       10  2) $100 - $200   142.44     1       0  Surburban       0   Phone\n",
       "1        6  3) $200 - $350   329.08     1       1      Rural       1     Web\n",
       "2        7  2) $100 - $200   180.65     0       1  Surburban       1     Web\n",
       "4        2    1) $0 - $100    45.34     1       0      Urban       0     Web\n",
       "5        6  2) $100 - $200   134.83     0       1  Surburban       0   Phone"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from sklift.datasets import fetch_hillstrom\n",
    "\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "bunch = fetch_hillstrom(target_col='visit')\n",
    "\n",
    "dataset, target, treatment = bunch['data'], bunch['target'], bunch['treatment']\n",
    "\n",
    "print(f'Shape of the dataset before processing: {dataset.shape}')\n",
    "\n",
    "# Selecting two segments\n",
    "dataset = dataset[treatment!='Mens E-Mail']\n",
    "target = target[treatment!='Mens E-Mail']\n",
    "treatment = treatment[treatment!='Mens E-Mail'].map({\n",
    "    'Womens E-Mail': 1,\n",
    "    'No E-Mail': 0\n",
    "})\n",
    "\n",
    "print(f'Shape of the dataset after processing: {dataset.shape}')\n",
    "dataset.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Divide all the data into a training and validation sample:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-07T01:01:42.579775Z",
     "start_time": "2021-02-07T01:01:42.442595Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "\n",
    "X_tr, X_val, y_tr, y_val, treat_tr, treat_val = train_test_split(\n",
    "    dataset, target, treatment, test_size=0.5, random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select categorical features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-07T01:01:42.600915Z",
     "start_time": "2021-02-07T01:01:42.585066Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['history_segment', 'zip_code', 'channel']\n"
     ]
    }
   ],
   "source": [
    "cat_cols = X_tr.select_dtypes(include='object').columns.tolist()\n",
    "print(cat_cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the necessary objects and combining them into a pipieline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-07T01:01:42.703537Z",
     "start_time": "2021-02-07T01:01:42.603875Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from category_encoders import CatBoostEncoder\n",
    "from sklift.models import ClassTransformation\n",
    "from xgboost import XGBClassifier\n",
    "\n",
    "\n",
    "encoder = CatBoostEncoder(cols=cat_cols)\n",
    "estimator = XGBClassifier(max_depth=2, random_state=42)\n",
    "ct = ClassTransformation(estimator=estimator)\n",
    "\n",
    "my_pipeline = Pipeline([\n",
    "    ('encoder', encoder),\n",
    "    ('model', ct)\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>Train pipeline as usual, but adding the treatment column in the step model as a parameter `model__treatment`.</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-07T01:01:44.020040Z",
     "start_time": "2021-02-07T01:01:42.707311Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/Maksim/Library/Python/3.6/lib/python/site-packages/sklearn/pipeline.py:354: UserWarning: It is recommended to use this approach on treatment balanced data. Current sample size is unbalanced.\n",
      "  self._final_estimator.fit(Xt, y, **fit_params)\n"
     ]
    }
   ],
   "source": [
    "my_pipeline = my_pipeline.fit(\n",
    "    X=X_tr,\n",
    "    y=y_tr,\n",
    "    model__treatment=treat_tr\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-26T18:07:44.970856Z",
     "start_time": "2020-04-26T18:07:44.964624Z"
    }
   },
   "source": [
    "Predict the uplift and calculate the uplift@30%"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-07T01:01:44.184968Z",
     "start_time": "2021-02-07T01:01:44.047865Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "uplift@30%: 0.0661\n"
     ]
    }
   ],
   "source": [
    "from sklift.metrics import uplift_at_k\n",
    "\n",
    "\n",
    "uplift_predictions = my_pipeline.predict(X_val)\n",
    "\n",
    "uplift_30 = uplift_at_k(y_val, uplift_predictions, treat_val, strategy='overall')\n",
    "print(f'uplift@30%: {uplift_30:.4f}')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
